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Abstract 

We extend recent methods for parametric sequence alignment to the parameter space for 
scoring RNA folds. This involves the construction of an RNA polytope. A vertex of this poly- 
tope corresponds to RNA secondary structures with common branching. We use this polytope 
and its normal fan to study the effect of varying three parameters in the free energy model that 
are not determined experimentally. Our results indicate that variation of these specific param- 
eters does not have a dramatic effect on the structures predicted by the free energy model. We 
additionally map a collection of known RNA secondary structures to the RNA polytope. 

Keywords: RNA secondary structure; plane tree; free energy; thermodynamic model; para- 
metric analysis 

1 Introduction 

Determining the structure of RNA molecules remains a fundamental scientific challenge, since 
current methods cannot always identify the "correct" fold from the large number of possible con- 
figurations. A common method for predicting the secondary structure of a single RNA molecule, 
termed the thermodynamic model, involves free energy minimization |2H l32| IM] . Extensions to 
this approach, such as suboptimal structure prediction and partition function calculations, still 
depend on the parameters from the thermodynamic model to score possible secondary structures. 
The free energy of a secondary structure is calculated by scoring substructures according to a set 
of parameters — most of which are determined experimentally (see [16] for a review). A dynamic 
programming algorithm, used in software packages like mfold |20|, [33] . computes the minimal free 
energy as well as the optimal secondary structure(s) [T9] . 

In this work, we address variation in the parameter space for scoring secondary structures, 
focusing on three parameters from the multi-branch loop energy function that are not based on 
measurement. Specifically, we address the following questions. What is the geometry of the pa- 
rameter space for scoring RNA folds and how does this geometry relate back to the biology? How 
sensitive is the thermodynamic model to variation of the ad-hoc multi-branch loop parameters? 
We answer these questions using geometric combinatorics. We find that variation of the multi- 
branch loop parameters has a smaller effect than the change in the parameter space coming from 
improved measurement. Moreover, regardless of the choice of multi-branch loop parameters used in 
the current version of the thermodynamic model, the minimal energy structures have a low degree 
of branching. 

Our results are achieved by applying techniques from geometric combinatorics to give a para- 
metric analysis of RNA folding. We construct an RNA polytope whose vertices correspond to sets 
of secondary structures with common branching. Its normal fan subdivides the parameter space so 
that the parameters lying in the same cone give the same minimal free energy structures. These 
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approaches have been used recently in parametric sequence alignment [7J |HJ [22] and for more gen- 
eral hidden Markov models [U [23] . There is also earlier polyhedral work on parametric sequence 
alignment |13|I30| and related work on secondary structure comparison [29] and sequence/structure 
alignment [IB]. We additionally make comparisons with biological structures, and this work sup- 
ports our theoretical results. 



2 Background 

2.1 Plane trees and RNA folding 

We use a simplified model of RNA folding in which a secondary structure S is represented by a 
rooted plane tree T = T(S). Single-stranded RNA sequences fold into molecular structures. One 
step in this folding process is the formation of Watson-Crick and also G-U base pairs. The set 
of (nested) base pairs determines the secondary structure of an RNA sequence. As illustrated in 
Figure [T] a secondary structure has two basic types of substructures — runs of stacked base pairs 
which are called helices and the single-stranded regions known as loops. Every component of a 
secondary structure is given an associated free energy score by the thermodynamic model. To a 
first approximation, the score of a loop is determined its degree — the number of base pairs contained 
in the loop. There are different energy functions for the external loop, hairpin loops, which have 
degree 1, bulge/internal loops with degree 2, and multi-branch loops with degree greater than 2. 
Suppose L is multi- branch loop, then the free energy of L is 

E(L) = a + bni + cn 2 + q, (1) 

where n\ is the number of single-stranded bases in L, ri2 is the number of helices in L, q is the 
sum of the single-base stacking energies in L, and a,b,c are the parameters for offset, free base, 
and helix penalties, respectively [34 . In this work, our analysis is primarily focused on the three 
parameters a, b, and c from this function since they are not experimentally determined. Our results 
are obtained by considering rooted plane trees as a simplified model of RNA folding. 

Plane trees have been used to enumerate possible RNA secondary structures [24J and also 
to compare them |17l 125] for some time now. The interaction between combinatorics and RNA 
folding has continued to develop over the last 20 years, including using trees as more abstract 
representations of RNA folding, for instance in [11 j and related work as well as in j2j [3J H3J- A 
rooted plane tree (also called plane tree or ordered tree [51 [27]) is a tree with a specified root 
vertex and such that the subtrees of any given vertex are ordered. This ordering comes from the 
5' — * 3' linear arrangement of the RNA sequence. Plane trees with n edges are one of the many 
combinatorial objects counted by the Catalan numbers 

C »=4l( 2 "Y (2) 
n + 1 \ n J 

To obtain T, we assign the root vertex to the exterior loop of S and the non-root vertices of T 
correspond to the remaining loops in S. Two vertices in T share an edge when their loops in S are 
connected by helices. As an example, we give a secondary structure in Figure [T]^ together with its 
associated plane tree in Figure[Tp. Technically, a secondary structure S must be free of pseudoknots 
in order to construct T. While pseudoknots do occur in secondary structures, the thermodynamic 
model cannot predict them and moreover one can create a nested, pseudoknot-free structure from 
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a given fold in several ways — some of which are in [26] and our approach is described the Materials 
and Methods section. Given a plane tree T with n edges, we write r for the degree of the root 




Figure 1: Secondary structures as rooted plane trees 



vertex and for < k < n, is the number of non-root vertices with k children. Thus, d^ gives 
the number of non-root vertices in T with degree k + 1, and this is the number of loops in S with 
k + 1 branches. To assign an energy to a plane tree, we assign weights to the vertices, based on the 
down degree of the vertex. In terms of secondary structures, we are assigning the same energy to 
each type of loop in the fold. This is a simplification of the scoring for the thermodynamic model, 
in which the energy of a structure is the sum of the energies of the loops. If T is a plane free with 
n edges, the free energy of T is written as 

n 

E{T) = a 3 r + a d + a\d\ + ^[c 2 + a 2 (k + l)]dk 

k=2 

n n 

= a 3 r + a d + a\d\ + (c 2 + 2a 2 ) d k + a 2 /^k - l)d k 

k=2 k=2 

= a 3 r + a d + a\d\ + (c 2 + 2a 2 )(n - do — d{) + a 2 {d - r) 

= (c 2 + 2o 2 )n + (a 3 - a 2 )r + (a - c 2 - a 2 )c?o + («i — c 2 — 2a 2 )<ii, 

where we have used the relations 

n n 

dk = n — do — di and ^^(^ — l)^/c = do — r 

k=2 k=2 

that hold for all plane trees (27] . To minimize free energy, we must minimize E(T) over the space of 
all plane trees. Since this space is infinite, we will typically think of n as being fixed but arbitrary 
and minimize the free energy function over the finite space of plane trees with n edges. For a given 
set of parameters ao, a±, a 2 , a 3 , c 2 , this is equivalent to minimizing the following inner product 

E'(T) = (9 2 ,e 3 ,e 4 )-(r,d ,d 1 ) (3) 

where 

#2 = as — a 2 9s = a — c 2 — a 2 # 4 = a\ — c 2 — 2o 2 . 
2.2 Geometric combinatorics 

In this section, we present some basic definitions in geometric combinatorics. We refer the reader 
to j!2[ [3~T] for a more detailed treatment. A set U C M. d is convex if for any two points x,y G U, 
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the line segment connecting x and y is contained in U, that is {ax + (1 — a)y | < a < 1} C U. 
For any subset U of M. d , the convex hull of U, written convll, is the intersection of all convex sets 
that contain U. A lattice polytope A C M. d is the convex hull of a finite collection of lattice points: 
A = conv.A, where A = {yi,y2,V3, • • ■ , 2/r} C Z d . 

Any lattice polytope A is characterized by a finite collection of defining inequalities 



{ci ■ x > 6j}ig/ where q G Z , x G A, and 6j G Z. 
A /ace F of A is a subset defined by setting some of the defining inequalities to equality, i.e. 



(4) 



F = < 



x G A 



'Ik 



and the dimension of F is the dimension of its affine span. The vertices of A are the O-dimensional 
faces while the facets have dimension dimA — 1. The boundary of A, written dA is the union of 
all faces of A of dimensions 0, 1, 2, • • • , dimA — 1. 

A convex polyhedral cone a is the positive hull of a finite collection of lattice points in Z rf : 
a = \t\Z\ + t2Z2 + • • • + t s z s | ti > 0, Zi £ Z d }, and we write a = (z%,Z2, ■ ■ ■ z s ). Associated to each 
lattice polytope A is its normal fan A/"(A) that is a collection of cones and subdivides M d . The rays 
(1-dimensional cones) of jV(A) are of the form (cj) for i G / in Q . Moreover, the cones a G A/"(A) 
are in one-to-one correspondence with faces F of A: 



op = { v G M ' | u ■ v < x ■ v Mu G F, Mx G A}. 



(5) 



Note that dimerp = dimA — dimF. In terms of minimization, equation ^ states that the points 
in F are minimizers of the dot product for vectors in ctf, among all points in A. As an example 
of the above concepts, we give a 2-dimensional polytope A in Figure [2]A. and its normal fan Af(A) 
in Figure |2j3. The four vertices of A correspond to the four 2-dimensional cones in jV(A), and the 
four facets of A correspond to the four rays of A/"(A). 



A) 




Figure 2: A 2-dimensional polytope A (A) and its normal fan M(A) (B) 
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3 Results 



3.1 Plane trees that minimize energy 

Fixing n > 5, the possible count vectors (r, do,d\) of plane trees are classified by the second author 
|14j and fall into one of four classes, as listed in Table [I] with r,do,d\ > in all cases. Since r, do, d\ 

Set of inequalities Vertices for n even Vertices for n odd 

r = 1 

(A) do = l {(1,1, n-1)} {(1,1, n-1)} 

di = n — 1 



(B) 



r = 1 

2<d <n /(l,2,n-3),(l,^,0),\ / (1, 2, n - 3), (1, 0), 



n-2d + l<d! \ (l,f,l),(l,n- 1,0) J \ (l,n-l,0) 

c?i < n — do — 1 



r = d 

(C) 2<d <n {(2,2,n-2),(n,n,0)} {(2, 2, n - 2), (n, n, 0)} 
<ii = n — do 

2 < r 

r < 2( i _ n + dl r(2,n-l,0),(n-2,n-l,0),l 

(D) 3 < do < n-1 { ( 7 2 3 1 : Mw 2 i \'° )l } (2,^,0), (3,^,0), 
n _ 2( i + 2<d 1 ^ 4j ' i2 ' 2 > U J J [ (2,3,n-4),(2,S±l,l) 
di < n — do — 1 



Table I: Sets of inequalities and corresponding vertices for plane trees 

must all be integers, the vertices in Table [1^ or Table |Ip differ depending on whether or not n is 
even or odd. We want to minimize the linear energy function over this point set (which includes 
count vectors from all four cases), and hence we let P n be the convex hull of the union of the four 
polytopes listed in Table [TJ Regardless of our choice of energy parameters, a minimum energy plane 
tree with n edges will occur at a vertex of P n . The following proposition describes the vertices of 

Pn- 

Proposition 3.1.1. Define as follows. 

_ J conv{(l, n±i,o),(l,n- l,0),(l,l,n- l),(n,n,0)} n odd 

n \ conv{(l,^,0),(l,f,l),(2,^±2,0),(l,n-l,0),(l,l,n-l),(n,n,0)} n even 

Then ^f n = P n for n > 5. 

Proof. Clearly \& n C P n and hence we'll show each lattice point of P n in Table [I] is contained in 
fy n . The normal fan of ^f n has rays 

{(-l,2,l),(l,0,0),(l,l-n,2-n),(0,0,l)} n odd 

{(-l,2,l),(l,0,0),(l,l-n,2-n),(0,0,l),(0,l,l)} n even 
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Moreover, for each lattice point t = (r, do,di) in Table [TJ one can verify that t satisfies the defining 
inequalities of \£ n : 

(r, d , di). (-1,2,1) >n 
(r, d , di) • (1,0,0) >1 
(r,d ,di)-(l,l-n,2-n) > 2n - n 2 
(r, d , di)- (0,0,1) >0 



and for n even we additionally have 

(r,d ,di)- (0,1,1) > 



n + 2 
2 



This gives P n C and we have equality. □ 
In the sequel, we will primarily focus on the rational tetrahedron 

A n :=conv{(l,^±l,0),(l,n-l,0),(l,l,n-l),(n,n,0)} 

regardless of whether n is even or odd. There are many reasons for this. First, asymptotically, there 
is no difference between P n and A n for ti even. The normal fan J\f(^P n ^j is obtained from J\f(A n ^) 
by adding a single ray and subdividing the full dimensional cone a = ((1, 0, 0), (0, 0, 1), (—1,2, 1)) 
corresponding to the vertex (1,^1,0). Thus, when n is even, the parameters giving (1,1, n — 
l),(l,n — 1,0), or (n,n, 0) the minimal energy are the same regardless of whether we use the 
subdivision of R 3 determined by M(P n ) or that determined by Af(A n ). Moreover, the parameters 
in a will yield (1, ^p,0), (1, ^, 1), or (2, ^^^,0) as minimal, and the trees corresponding to these 



three count vectors are all similar, as discussed in Proposition 3.3.1 



3.2 Lattice points in dP n 

Suppose S is a secondary structure whose plane tree has count vector (r, do, di). If (r, do,di) £ intP„ 
then there is no choice of parameters that can make S have minimal free energy. Conversely, if 
(r, do, di) £ int-F for some face F of P n , then any parameter vector in the cone of C N{P n ) yields 
S with minimal energy. We thus want to determine the count vectors lying on dP n . 

All four sets of inequalities in Table [I] intersect dP n . Let Qa, Qb, Qci and Qd be the polyhedra 
described in Table [T}^, [Tj3, |l]C, and|Ip, respectively. Then, Qa,Qb,Qc C dP n and 

Qa ={(1,1,71-1)} 
{QaUQb) HZ 3 =conv{(l,n- l,0),(l,l,n- l),(l,^±l,0)}nZ 3 
(QaUQc) HZ 3 = conv{(l,l,n- 1), (n, n, 0)} n Z 3 . 

Since Qd is 3-dimensional, it cannot be contained in the boundary of P n . We do, however, have 

{Q D n dP n ) HZ 3 = (mtE 1 U intFi U mtF 2 ) n Z 3 , (6) 

where E x = conv{(n, n, 0), (1, ^±1, 0)}, F x = conv{(n, n, 0), (1, 0), (1, 1, n - 1)}, and F 2 = 
conv{(n, n, 0), (1, 0), (1, n — 1,0)}. Equation ([6| follows from counting lattice points in the 
objects on the left and right hand sides of the equation using the same technique as in Proposition 
3.2.1 The plane trees defined in Table|Ip that lie on dP n satisfy di = or r = 2do — n + d\. Their 
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associated secondary structures either have no bulges/internal loops or have a maximal number of 
helices in the exterior loop. 

Next, we count the number of lattice points in the interior of each face of P n . For an edge of 
the form E = conv{(xi, y\, z\), (x2, y2, -22)}, we use the formula 



# (int-ETl I?) = gcd(|xi - x 2 \,\yi - 2/2 1 , \zi 



Z2 



1 



and obtain the following counts. The edges conv{(n, n, 0), (1, ^^^p, 0)} and conv{(l, 1, n— 1), (1, ^njp, 0)} 
each have h(n — 3) lattice points in their interiors. A total of |(n — 5) lattice points are in the 
interior of conv{(l, n— 1,0), (1, 0)}. The interior of conv{(n, n, 0), (1, 1, n — 1)} contains n — 2 
lattice points, and there are no interior lattice points for the edges conv{(n, n, 0), (1, n — 1, 0)} and 
conv{(l,l,n- l),(l,n- 1,0)}. 

To determine the number of lattice points in a facet F of P n , we use Pick's theorem [15] 

# (intF n 1?) = Area(F) - - [# (dF n Z 3 )] + 1, 

where the area of F is normalized with respect to the 2-dimensional sublattice containing F. We 
illustrate Pick's theorem with the following proposition. 

Proposition 3.2.1. There are no interior lattice points in the facet 

F = conv{(l,l,n- 1), (1, n - 1, 0), (n, n, 0)}. 



Proof. The triangle F lies on the hyperplane —X + (n — 1)1" + (n — 2)Z 



n 



In in R 3 , and 



thus we normalize the area of F by dividing by \/(— l) 2 + (n — l) 2 + (n — 2) 2 = y / 2(n 2 — 3n + 3). 
Before normalization, the area of .F is 



1 1 n 
1 n — 1 n 
1 1 1 



\ 

- ^2n 4 - 10n 3 + 20n - 18n + 6 
-(n - l)-v/2(" 2 -3n + 3). 





1 


n — 1 


n 


2 


n — 1 








+ 


n — 1 








+ 


1 


1 


n 




1 


1 


1 




1 


1 


1 



Moreover, using the counts above for the interior lattice points in the edges of F, we have 

# (dF n Z 3 ) = (n - 2) + + + 3 = n + 1. 

Applying Pick's theorem yields 

#(intFnZ 3 ) = !(„_!)_ !(„ + !) + 1 
= 0. 



□ 



For the other three facets of P n , each contains |(n — 3) 2 interior lattice points. In total, this 
gives j(3n 2 — 8n + 13) lattice points on dP n , all of which correspond to plane trees. 
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3.3 Biological meaning of P n and M(P n ) 
3.3.1 The vertices of P n 

The vertices of P n represent the secondary structures with the maximum number of helices in a 
loop — so-called "maximal degree of branching" — and the fewest helices in a loop — or "minimal 
degree of branching" — as described below. 

If T is a plane tree represented as a vertex of P n then T has n edges and n + 1 vertices. If in 
addition, T has count vector (ra, n, 0) then the degree of the root vertex is n and the n + 1 vertices 
are the root together with the n leaves (vertices with children). Thus, a secondary structure 
corresponding to T has no internal loops, bulges, or multi-branch loops and the exterior loop has 
n helices. 

If T has count vector (1, l,n — 1), the root vertex has degree 1, there is one leaf, and n — 1 
vertices of degree 2 (1 child). Thus, T is a straight line, and a secondary structure corresponding 
to T has no multi-branch loops and the exterior loop has one helix.. 

If T has count vector (1, n — 1, 0), the n + 1 vertices are the root (with degree 1), n — 1 leaves, 
and one vertex of degree n. Secondary structures corresponding to T have no internal loops or 
bulges and one multi-branch loop with n helices. In addition, the exterior loop has one helix. 

The remaining vertices — (1, ^,0) for n odd and (1, §, 1), (2, ^,0), or (1, ^,0) for n even- 
are dealt with in the following proposition. 

Proposition 3.3.1. (i) For n odd, any plane tree with count vector (1, 0) satisfies c?2 = 
and di = for i > 2. 

(ii) For n even, any plane tree with count vector (1,^,1) or (2,^^,0) satisfies cfe = and 
di = for i > 2. 

(Hi) For n even, any plane tree with count vector (1, *njr^, 0) satisfies cfo = d's = 1, and di = 
for i > 3. 

Proof. For suppose n is odd and T is a plane tree with n edges, r = 1, do = and d\ = 0. 
Then, T has ^p- + 1 vertices of degree 1, and the remaining n + 1 — (^^-^ + l) = vertices have 
degree at least 3. Thus, 

degv = |(n + 1) + 1 + degf 

ueV degri > 3 

>I(n+l) + l + |(n-l) 
= 2n 

However, since degu = 2\E\, we must have equality. Thus, all other vertices must have degree 

3 (2 children). 

The proof of (|n]) is nearly identical to that of 



For (iii), a plane tree with n edges, r = 1, do = '°4r L and d\ = has ^t- vertices of degree 1 



and zero vertices of degree 2. Such a tree cannot have all other vertices of degree 3 as this would 
yield a graph with an odd number of odd vertices. Thus, there is a vertex vq with degree p with 
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p > 4 even. This gives 



degf = |(n + 4) + p + deg 

«6V degt; > 3 

> i(n + 4) + 4 + |(n-4) 
= 2n 



As before this inequality must be an equality, and hence p = 4 and all other vertices have degree 
3. □ 
Thus, for n odd, the count vector (1, 0) corresponds to secondary structures with no interior 
loops/bulges, all multi-branch loops have 3 helices, and the exterior loop has one helix. When n 
is even, a secondary structure with n helices and all three of these properties is not possible. We 
instead have three cases, each with exactly one of the properties relaxed: a structure corresponding 
to (1, |, 1) has one interior loop/bulge, the count vector (1, ^^,0) arises from structures having 
one multi-branch loop with 4 helices (all other multi-branch loops have 3 helices), and the exterior 
loop of a structure corresponding to (2, 0) has 2 helices. For n odd, plane trees representative 
of those described in this section are shown in Figure [3} 




Remark. The map from plane trees to count vectors is generically many-to-one. Three of the 4 
vertices, however, correspond to exactly one tree: (n, n, 0), (1, 1, n — 1), (1, n — 1,0). The trees with 
count vector (1, 0) are in one-to-one correspondence with full binary trees with n — 1 edges (by 
removing the root vertex). There are Cn-i such trees J2J/, where Cn-i is the ^Y^th Catalan number 
defined in equation ff2§. 
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3.3.2 The rays in Af(P n ) 

The energy function E 1 in equation (j3j scores a secondary structure with n helices based on the 
number of helices in the exterior loop, the number hairpin loops, and the number of bulges/internal 
loops. The normal fan N{P n ) of P n subdivides the (#2, #3, #4) parameter space. Each vector in 
(x, y, z) G W\Q<2\ x M[#3] x M[04] corresponds to a scoring function in which x gives the weight of 
a helix in the external loop, y gives the weight of a hairpin loop, and z gives the weight of a 
bulge/internal loop. 

The fan N{P n ) consists of cones generated by elements in the power set 



Thus, a parameter vector v G M\02\ x M\6^\ x M[0 4 ] has the form c\y\ + C22/2 + C32/3 with c%, C2, C3 > 
and 2/1,2/2, 2/3 G {(1, 0, 0), (0, 0, 1), (—1,2, 1), (1, 1 — n, 2 — n)}. A generic vector in M 3 lies in the 
interior of one of the 3-dimensional cones in M(P n ), and hence we give a brief interpretation of the 
parameter vectors with c, 7^ for i = 1, 2, 3. 

Scoring vectors in the interior of the cone ((0, 0, 1), (1,0, 0), (1, 1 — n, 2 — n)) penalize for hairpin 
loops and can independently penalize or reward for helices in the exterior loop and bulges/internal 
loops. If v G int((0, 0, 1), (1, 0, 0), (— 1, 2, 1)) then v gives a penalty for both hairpin loops and interior 
loops/bulges. Helices in the exterior loop can be beneficial or harmful with this scoring vector, and 
v can equally penalize helices in the exterior loop, hairpin loops, and internal loops/bulges. Scoring 
vectors in the interior of one of the two remaining cones can reward or penalize all three quantities. 
These are not independent, however. For instance, if v G int((l, 1 — n,2 — n), (0,0, 1), (—1,2, 1)) 
and hairpin loops are disadvantageous under u's scoring scheme then helices in the exterior loop 
are beneficial. If w G int((l, 0, 0), (1, 1 — n, 2 — n), (—1, 2, 1)) and w rewards hairpin loops then w 
rewards bulges/internal loops. Similarly, if w penalizes for bulges/internal loops then w penalizes 
for hairpin loops. Also, scoring vectors in the interior of the cone ((1, 1 — n, 2 — n), (0, 0, 1), (—1,2, 1)) 
can equally reward hairpin loops, internal loops/bulges, and helices in the exterior loop. 

3.4 Variation in the parameter space 

In this section, we add additional information to the parameters {#2, #3, #4} hi order to study the 
effect of varying the multi-branch loop parameters in the thermodynamic model of RNA folding. 
We obtain free energy parameters for plane trees using one of the four combinatorial sequences 
having the form 



In these sequences, the segments of the form Y 6 pair with the Z 6 segments while the X nucleotides 
remain unpaired, and moreover all the loops of a given type have the same free energy. We do not 
include the possibilities X = U and {Y, Z} = {C, G} or X = G and {Y, Z} = {A, U} because we 
want to prevent the G — U pairing. For a given sequence, we use both the current (version 3.0) 
|19j and previous (version 2.3) (28] thermodynamic parameters, determined by the Turner lab. The 
parameters 03,00, and a\ are based on experimental measurement and are listed in Table [H] The 
parameters 02 and C2 come from the multi-branch loop scoring function in equation ([!]), where the 
parameters a, b, and c in this function are not determined experimentally. If L is a multi-branch loop 



& ({(1, 0, 0), (0, 0, 1), (-1, 2, 1), (1, 1 - n, 2 - n)}) . 




and {Y, Z} = {C, G} 
and {Y, Z} = {A, U} " 
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with m single-stranded bases and 112 helices and L appears in a secondary structure for one of our 
4 combinatorial sequences, then we have n\ = 4rj2- Additionally, for each helix in L, the single-base 
stacking energy is 03. Thus, free energy of L in equation ([I]) becomes E(L) = a + 4bri2 + cri2 + 03712, 
and the parameters 02 and C2 in the free energy function E' in equation ^ can be written as 
02 = 46 + c + 03 and C2 = a. Table III illustrates three types of variation: variation of specific 



n Turner 3.0 Values 

sequence 

a 3 a ai 



X=A, Y=G, Z=C 
X=A, Y=C, Z=G 
X=C, Y=A, Z=U 
X=C, Y=U, Z=A 



-1.9 4.1 2.3 

-1.6 4.5 2.3 

-0.4 5.0 3.7 

-0.6 4.9 3.7 



Turner 2.3 Values 

03 ao 01 
-1.9 3^5 3TF 
-1.6 3.8 3.0 
-0.4 4.3 4.0 
-0.6 4.2 4.0 



Table II: Energy parameters for plane trees 



nucleotides in combinatorial sequence, variation of the version of Turner's energy parameters, and 
variation of a, b, c parameters. The effect of varying the multi-branch loop parameters a, 6, c is more 
or less the same for each sequence and energy table: two different count vectors can be minimal 
depending on the value of a + 126 + 3c. Technically, a third vertex of P n has minimal energy in some 
cases when b = c = 0. However, if the offset and helix penalties are both zero, the multi-branch 
energy function will have no penalties for the number of single-stranded bases and the number of 
stems in a loop. This does not agree with the free energy model. 

Varying the sequence alone, we obtain differences in the cut-off values for a + 126 + 3c. On 
the whole, however, nucleotide variation in the combinatorial sequence does not give qualitative 
differences in the minimal energy plane trees. 

We do see (in 3 of the 4 sequences) qualitative differences in the minimal energy trees when 
we compare version 3.0 parameters to version 2.3 parameters. For instance, when a + 126 + 3c 
is large, 3 of the 4 sequences give the 'straight line' tree with count vector (1, l,n — 1) minimal 
with version 3.0 parameters. Using version 2.3, all four sequences result in the maximal degree of 
branching, with count vector (n, n, 0) having minimal energy. This difference in minimal energy 
trees is not too surprising because the change from version 2.3 to version 3.0 was based on more 
accurate experimental measurement. The secondary predicted structures have indeed changed. It 
is worth noting that if we use the actual penalties for offset, free base, and helix from versions 2.3 
and 3.0 of the Turner energies, we obtain 



a + 126 + 3c: 



4.6 v3.0 

9.7 v2.3. 



Thus, all four combinatorial sequences yield (n, n, 0) with minimal energy for version 2.3. Moreover, 
as 9.7 is a fair amount greater than the cut off for all four sequences, slight variation in these 
parameters will not change the predicted structure. For version 3.0, the two combinatorial sequences 
with unpaired poly-A segments have (1, 0) being minimal while (1, 1, n — 1) is minimal for the 
other two sequences. Also, 4.6 is much closer to the cut off values for the sequences. Small changes 
in these parameters could change which trees have minimal energy. 
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Vertex Rays in M{P n ) Energy version 



Restrictions on a, 6, c 



Sequence: 
[X,Y,Z] = 



(1,1 -n, 2 - n) 
(l.l.n-l) (1,0,0) 
(-1,2,1) 



(1, 2*1,0) 



(n,n,0) 



(0,0,1) 
(1,0,0) 
(-1,2,1) 



(1,1 -n, 2 -n) 
(0,0,0) 
(-1,2,1) 



(0,0,1) 

(l,n-l,0) (1,0,0) 

(1, 1 - n, 2 - n) 



3.0 (2.3) a + 126 + 3c > < 



3.0 (2.3) 



3.0 (2.3) 



3.0 (2.3) 



'N/A (N/A) 

4.9 (N/A) 

3.6 (N/A) 

[ 4.3 (N/A) 



a + 126 + 3c < < 



'6.0 (5.4) 

4.9 (5.4) 

3.6 (4.7) 

^4.3 (4.8) 



a + 126 + 3c > I 



' 6.0 (5.4) 

N/A (5.4) 

N/A (4.7) 

[N/A (4.8) 



6 = c = 0, a = < 



' 6.0 (5.4) 

N/A (5.4) 

N/A (4.7) 

[N/A (4.8) 



[A,G,C] 
[A,C,G] 
[C,A,U] 
[C,U,A] 

[A,G,C] 
[A,C,G] 
[C,A,U] 
[C,U,A] 

[A,G,C] 
[A,C,G] 
[C,A,U] 
[C,U,A] 

[A,G,C] 
[A,C,G] 
[C,A,U] 
[C,U,A] 



Table III: Restrictions on a, 6, c parameters from full-dimensional cones in M{P n ) 



3.5 RNA STRAND database analysis 
3.5.1 Overall shape of data 

Our initial collection of secondary structures contains 145 structures from 137 distinct RNA se- 
quences, as described in Materials and Methods. The sequences range from 19 to 4216 nucleotides. 
We exclude structures for which the number of helices is less than 5 from further analysis. This 
reason for this is that not all the vertices of P n listed in Proposition |3.1.1 are valid and distinct 



when n < 4. We have 110 structures with n > 5 (from 103 sequences) having average (median) 
length of 739 (367) and n = 27 (13). We break these into classes, based on the number of helices, 
as depicted in Table |TV) 

While our collection contains more small and medium trees as compared to large trees, this 
reflects the frequency in the RNA STRAND database. For instance, according to an analysis done 
by RNA STRAND, the average (median) number of helices over the entire database is 28 (8). 
This count does, however, include the sequences with fewer than 5 helices and includes a less 
restrictive definition of bulges/internal loops and helices: internal loops/bulges can have any number 
of unpaired bases and helices can have any number of base pairs. Our large trees come from 16S 
ribosomal RNA and 23S ribosomal RNA sequences and have a minimum sequence length of 954 
nucleotides. In the RNA STRAND database, only 20% of the 4666 structures contain at least 954 
nucleotides. 
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Category 


Ranj 


re of n 


# of trees 


Average length 


Median length 


Average n 


Median n 


Small 


5 - 


- 12 


50 


244 


220 


9 


9 


Medium 


13 


-40 


40 


676 


512 


22 


19 


Large 


41 - 


- 136 


20 


2104 


1831 


82 


76 



Table IV: Trees in RNA STRAND collection by size 



3.5.2 Location of count vectors on polytope 

It is of great importance to know when biologically correct secondary structures can be predicted 
by the free energy model. With our simplified energy function E' in ([3]), we ask if the biologically 
correct structures can be minimal for some choice of parameters. As mentioned in §3.2| this 
translates into determining when the corresponding count vectors lie on the boundary of P n . 

Seventy-one out of 110 count vectors lie on the boundary of P n : 49 lying on the interior of 
a facet, 18 lying on the interior of an edge, and 4 occurring as vertices. The average number of 
edges for plane trees on the boundary of P n is 17 and is 45 for plane trees in the interior of P n . 
Of those contained in the interior of a facet, 28 are minimal for parameters in ((—1,2,1)), 7 are 
minimal for parameters in ((0,0,1)), and 14 are minimal for parameter values in ((1,0,0)). Of 
those contained in the interior of an edge, 8 are minimal for parameters in ((— 1, 2, 1), (1, 0, 0)), 
9 are minimal for parameters in ((—1,2,1,(1,1 — n, 2 — n)), and 1 is minimal for parameters in 
((1, 0, 0), (0, 0, 1)). The 4 count vectors that are vertices of P n satisfy n = 5 or 6 and consist of the 
set {(5, 5,0), (6, 6,0), (1,4,0), (1,1, 4)}. Figure [3] shows the beat ion of the count vectors for small, 
medium, and large trees, given in terms of the percentage of trees in each category. 




facet edge vertex interior 



Figure 4: Location of count vectors on P n for small, medium, and large trees (in percentage) 
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3.5.3 Closest vertex to count vectors 



In order to determine which of the 4 vertices of P n is closest to a given count vector, we map the 
tetrahedron 

conv{(l,n,n,0),(l,l,l,n-l),(l,l,n- 1,0), (1,1,^,0)} 

onto the standard tetrahedron with vertices {(1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0), (0, 0, 0, 1)}. This is 
accomplished with the following matrix 



1 



1 



n — 1 


n 

n — 3 



n — 1 


1 

n — 3 






2 




1 



2n(n - 2) 2 
(n- l)(n-3) (n-l)(n-3) 



n — 3 



7? 



n — 1 
1 

n — 3 



2(n 



(n-l)(n-3) 



(7) 



which has determinant 
sum 



(n-l) 2 (n-3) ' 
ai(n,n,0) + a 2 (l,l,n 



For a given n, any count vector (r, do,di) can be written as a 



l)+a 3 (l,n-l,0)+a 4 (l,^,0) 

with < ai < 1 and 01 + 02 + 03 + 04 = 1. After applying the linear transformation Q, the lattice 
point (l,r, do,di) will have coordinates (01,02,03,04). The coordinate Oj gives a measure of the 
'closeness' to vertex i. For a given RNA structure, the largest of the Oj gives the vertex closest to 
its count vector. Moreover, if t = max{ai, 02, 03, 04} then 0.25 < t < 1. 

Fifty-two of the 110 structures are closest to (1, ^p,0), 38 are closest to (1,1, n — 1), 10 are 
closest to (n, n, 0), and 6 are closest to (1, n — 1, 0). Additionally, we have 2 that are closest to both 
(1, l,n — l) and (n, n, 0) and 2 that are closest to both (1, 1, n — 1) and (1, 0). The average 
values of (01,02,03,04) over the 110 structures are (0.181,0.357,0.138,0.332) which shows that as 
a whole, the count vectors are closest to (1, 1, n — 1) and (1, ^^p, 0). 

We say a count vector is 'close' to vertex i if Oj > 0.625. The value 0.625 is halfway in between 
the smallest and largest possible values of Oj. With this definition, 22% of the small trees are 
close to vertices, 5% of the medium trees are close to vertices, and no large trees are close to 
vertices. Thirteen trees in total are close to vertices, of which 8 are close to (1, 1, n — 1), 2 are close 
to (1, ^^,0), 2 are close to (n,n, 0), and 1 is close to (l,n — 1,0). All thirteen of these lattice 
points lie on the boundary of P n and hence correspond to minimal energy trees for some choice of 
parameter values. 



4 Discussion and Conclusions 

We have used a simple scoring scheme for scoring RNA folds: energy is assigned to a secondary 
structure based solely on the total number of helices, the number of helices in the exterior loop, 
and the numbers of hairpin loops and bulges/internal loops. Fixing the total number of helices, 
the extremal folds are those with the maximal and minimal degrees of branching. When a generic 
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parameter vector is chosen, precisely one of those will have minimal energy. For more specific 
choices of parameters (biologically realistic or not), the number of minimal count vectors is on 
the order of the square of the total number of helices. While this seems large, the total number of 
count vectors that cannot be minimal for any choice of parameters is on the order of the cube of the 
total number of helices. Thus, when this total is large, we would not expect such a scoring scheme 
to accurately predict the correct structures. This is supported by our RNA STRAND analysis in 
which 85% of the count vectors from known structures with a high number of helices cannot be 
minimal for any choice of parameters. None of these structures are 'close' to the extremal folds. 
This is not unexpected, however, since even the highly detailed free energy model is not accurate 
for large RNA molecules [9]. 

On the other hand, when the total number of helices is small, only 10% of the known structures 
cannot be minimal for our scoring scheme. While the scoring function used in this work is too 
simplistic to implement in a prediction software, our results suggest that for small RNA molecules, 
the full free energy model is not necessary for accurate predictions. We are not the first to make 
this observation, for [10] analyzed some simple probabilistic RNA folding models — one with as few 
21 free parameters — whose accuracies are comparable to mfold's. In their study, the sequences used 
for testing came from ribonuclease P RNA, transfer mRNA, and signal recognition particle RNA 
sequences, all of which yield small to medium trees by our classification. While 21 parameters is far 
too many for parametric analysis using polyhedral geometry, perhaps a simple model incorporating 
some thermodynamics and some probabilistic parameters can accurately predict the folding of small 
RNA molecules. 

We compared the variation of multi- branch loop parameters to two other types of variation in the 
parameter space. Fixing the combinatorial sequence and energy version, two possible count vectors 
can be minimal by varying the multi-branch loops parameters. If we use the most recent (accurate) 
energy version, we find that for 3 of the 4 sequences, these two count vectors include (1, 1, n — 1) 
and (1, ^-^,0). Interestingly, these two vertices are closest to the known structures in our RNA 
STRAND collection. Moreover, regardless of the choices of multi-branch loop parameters in the 
current version of the thermodynamic model, predicted structures have a low degree of branching — 
both in the exterior loop and in the multi-branch loops. Out of the three possible variations, the 
most significant changes come from varying the energy version, as the possible predicted structure 
for version 2.3 have a high degree of branching. Even though the penalties for off-set, free base 
and helix in the multi-branch loop energy calculation are chosen without specific measurement, 
they do not appear to have a dramatic effect on the predicted structures. One would hope that 
the parameters determined experimentally are what truly govern the predicted structures, and our 
findings support this possibility. 

5 Materials and Methods 

5.1 Selection of secondary structures from RNA STRAND database 

The RNA STRAND database pQ was searched by type of RNA (for example, 16S ribosomal RNA, 
cis-regulatory element, or group I intron). Each type of RNA was sorted by molecule length, and 
structures were selected from a variety of organisms to be representative of the different lengths 
appearing in the database for that type of RNA. Visual inspection of the secondary structures was 
important in the selection of the structures for our collection. It allowed for the inclusion of similar 
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length structures with different types of branching. It also prevented our collection from containing 
nearly identical structures formed by two different RNA molecules of the same type. Finally, visual 
inspection kept our collection from having a plethora of structures with only one or two helices; 
these structures are overrepresented in the RNA STRAND database. 

5.2 Removal of pseudoknots from .ct files 

In order to obtain a plane tree from a give secondary structure, pseudoknots were removed. A perl 
script read the .ct file and stored the closing pairs of all helices, where the helices are defined is §5.3| 
Each pair and (i',f) of closing pairs was tested to see if i < i' < j < f. If true, the pairs (i, j) 
and («', j') were printed to a file. Next, for each pair (i, j) and in the output file, one of the 

associated helices was removed according to the following rubric. If some closing pair appears 
multiple times, its helix was removed under the assumption that it formed a pseudoknot. If both 
(i, j) and (i',f) were not listed with any other closing pairs, the shorter of the 2 corresponding 
helices was removed. In the event that the two helices had the same number of paired bases, two 
versions of the .ct file were saved — one with the first helix removed and one with the second helix 
removed. 

5.3 Calculation of n,r,do,di from .ct files 

After all the pseudoknots were removed from the .ct files of secondary structures in our collection, 
a perl script calculated n, r, do, and d\. In our simplified model of RNA folding, all helices have the 
same energy independent of the number of base pairs in the helix. Similarly, all bulges/internal 
loops have the same energy regardless of the number of free bases in the loop. Because of this, very 
small bulges/internal loops and very short helices were ignored. Bulges and interior loops were 
required to have at least 3 unpaired bases. No restrictions were placed on the number of free bases 
in a hairpin loop, which was important so as to maintain the graph structure (edges connecting 
two vertices). 



Figure 5: Helices and internal loops: A, B, and C are fragments from structures in the RNA 
STRAND database 

Each helix with choice of closing pair has a 'left length' and 'right length' of the helix. The left 
length of a helix is the number of bases in the portion of the sequence that terminates at one of 



G 



c 3' 



A) 




16 



the closing bases. The right length of a helix is the number of bases in the portion of the sequence 
that originates at one of the closing bases. The closing pair of a helix as well as its right length 
are depicted in Figure [5]^. For this structure, the helix with closing pair G-C has left length 28 
and right length 25. For our analyses, a helix was defined to have both the left and right length 3 
or greater. Thus, the piece of secondary structure shown in Figure [5)3 has two helices — one with 
left and right length 5 and one with left and right length 3 — and one hairpin loop. Similarly, with 
our definitions, the fragment depicted in Figure [5p has only 1 interior loop that contains the base 
pairs G-C and U-A. The single C-G base pair is not considered a helix, and since each of the 
internal loops containing the C-G pair have more than 3 unpaired bases, the C-G base pair is not 
considered a part of either helix. 
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